Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[amdgpu] Part3 update runtime module #6486

Merged
merged 12 commits into from
Dec 30, 2022

Conversation

galeselee
Copy link
Contributor

@galeselee galeselee commented Oct 31, 2022

Issue: #6434

Brief Summary

  1. This is a special part of the Tacihi runtime module for the AMDGPU backend. Tacihi's runtime module uses clang++ to generate LLVM IR is different in memory allocation differs from the cpu-generated LLVM IR. The following is an example.
C/C++ code
void func(int *a, int *b) {
    *a = *b;
}
x86_64 backend LLVM IR
define dso_local void @cpu_func(i32* %0, i32* %1) #2 {
  %3 = alloca i32*, align 8
  %4 = alloca i32*, align 8
  store i32* %0, i32** %3, align 8
  store i32* %1, i32** %4, align 8
  %5 = load i32*, i32** %4, align 8
  %6 = load i32, i32* %5, align 4
  %7 = load i32*, i32** %3, align 8
  store i32 %6, i32* %7, align 4
  ret void
}
__global__ function on AMDGPU
define protected amdgpu_kernel void @global_func(i32 addrspace(1)* %0, i32 addrspace(1)* %1) #4 {
  %3 = alloca i32*, align 8, addrspace(5)
  %4 = alloca i32*, align 8, addrspace(5)
  %5 = alloca i32*, align 8, addrspace(5)
  %6 = alloca i32*, align 8, addrspace(5)
  %7 = addrspacecast i32* addrspace(5)* %3 to i32**
  %8 = addrspacecast i32* addrspace(5)* %4 to i32**
  %9 = addrspacecast i32* addrspace(5)* %5 to i32**
  %10 = addrspacecast i32* addrspace(5)* %6 to i32**
  %11 = addrspacecast i32 addrspace(1)* %0 to i32*
  store i32* %11, i32** %7, align 8
  %12 = load i32*, i32** %7, align 8
  %13 = addrspacecast i32 addrspace(1)* %1 to i32*
  store i32* %13, i32** %8, align 8
  %14 = load i32*, i32** %8, align 8
  store i32* %12, i32** %9, align 8
  store i32* %14, i32** %10, align 8
  %15 = load i32*, i32** %10, align 8
  %16 = load i32, i32* %15, align 4
  %17 = load i32*, i32** %9, align 8
  store i32 %16, i32* %17, align 4
  ret void
}
__device__ function on AMDGPU
define hidden void @device_func(i32* %0, i32* %1) #2 {
  %3 = alloca i32*, align 8, addrspace(5)
  %4 = alloca i32*, align 8, addrspace(5)
  %5 = addrspacecast i32* addrspace(5)* %3 to i32**
  %6 = addrspacecast i32* addrspace(5)* %4 to i32**
  store i32* %0, i32** %5, align 8
  store i32* %1, i32** %6, align 8
  %7 = load i32*, i32** %6, align 8
  %8 = load i32, i32* %7, align 4
  %9 = load i32*, i32** %5, align 8
  store i32 %8, i32* %9, align 4
  ret void
}
  1. There are some differences in the place about allocainst, specifically about addrspace (for AMDGPU, this will be helpful). I have not found documentation describing how to write the correct LLVM IR on AMDGPU, through my observation of the LLVM IR generated by clang++/hipcc. We need to deal with the arguments of the __global__ function and the allocainst (including specifying the addrspace of allocainst and performing addrspace-cast) while for the __device__ function we do not need to deal with the arguments of the function.

@netlify
Copy link

netlify bot commented Oct 31, 2022

Deploy Preview for docsite-preview ready!

Name Link
🔨 Latest commit c015143
🔍 Latest deploy log https://app.netlify.com/sites/docsite-preview/deploys/63a5c7c38a18f0000834b948
😎 Deploy Preview https://deploy-preview-6486--docsite-preview.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

Copy link
Contributor

@jim19930609 jim19930609 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, for hardware-specifc code or configurations, we should comment a reference link to a specific chapter/part of AMD-GPU-guide whenever we use some magic strings or magic numbers. This is especially true for setting up function attributes or so.

The previously implemented CUDA backend is filled with magic/hack and lacks enough comments/explanations. Let's improve this situation starting from AMD GPU. Thanks!

taichi/runtime/llvm/llvm_runtime_executor.cpp Show resolved Hide resolved
taichi/runtime/llvm/llvm_context.cpp Outdated Show resolved Hide resolved
taichi/runtime/llvm/llvm_context.cpp Outdated Show resolved Hide resolved
taichi/runtime/llvm/llvm_context.cpp Outdated Show resolved Hide resolved
taichi/runtime/llvm/llvm_context.cpp Outdated Show resolved Hide resolved
taichi/runtime/llvm/llvm_context.cpp Outdated Show resolved Hide resolved
@galeselee galeselee closed this Nov 14, 2022
@galeselee galeselee reopened this Dec 2, 2022
@galeselee galeselee marked this pull request as draft December 2, 2022 07:49
@galeselee galeselee marked this pull request as ready for review December 22, 2022 13:20
Copy link
Contributor

@jim19930609 jim19930609 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@galeselee galeselee merged commit d321d12 into taichi-dev:master Dec 30, 2022
feisuzhu pushed a commit to feisuzhu/taichi that referenced this pull request Jan 5, 2023
Issue: taichi-dev#6434

### Brief Summary
1. This is a special part of the Tacihi runtime module for the `AMDGPU`
backend. Tacihi's runtime module uses `clang++` to generate `LLVM IR` is
different in memory allocation differs from the cpu-generated `LLVM IR`.
The following is an example.
```
C/C++ code
void func(int *a, int *b) {
    *a = *b;
}
x86_64 backend LLVM IR
define dso_local void @cpu_func(i32* %0, i32* %1) taichi-dev#2 {
  %3 = alloca i32*, align 8
  %4 = alloca i32*, align 8
  store i32* %0, i32** %3, align 8
  store i32* %1, i32** %4, align 8
  %5 = load i32*, i32** %4, align 8
  %6 = load i32, i32* %5, align 4
  %7 = load i32*, i32** %3, align 8
  store i32 %6, i32* %7, align 4
  ret void
}
__global__ function on AMDGPU
define protected amdgpu_kernel void @global_func(i32 addrspace(1)* %0, i32 addrspace(1)* %1) taichi-dev#4 {
  %3 = alloca i32*, align 8, addrspace(5)
  %4 = alloca i32*, align 8, addrspace(5)
  %5 = alloca i32*, align 8, addrspace(5)
  %6 = alloca i32*, align 8, addrspace(5)
  %7 = addrspacecast i32* addrspace(5)* %3 to i32**
  %8 = addrspacecast i32* addrspace(5)* %4 to i32**
  %9 = addrspacecast i32* addrspace(5)* %5 to i32**
  %10 = addrspacecast i32* addrspace(5)* %6 to i32**
  %11 = addrspacecast i32 addrspace(1)* %0 to i32*
  store i32* %11, i32** %7, align 8
  %12 = load i32*, i32** %7, align 8
  %13 = addrspacecast i32 addrspace(1)* %1 to i32*
  store i32* %13, i32** %8, align 8
  %14 = load i32*, i32** %8, align 8
  store i32* %12, i32** %9, align 8
  store i32* %14, i32** %10, align 8
  %15 = load i32*, i32** %10, align 8
  %16 = load i32, i32* %15, align 4
  %17 = load i32*, i32** %9, align 8
  store i32 %16, i32* %17, align 4
  ret void
}
__device__ function on AMDGPU
define hidden void @device_func(i32* %0, i32* %1) taichi-dev#2 {
  %3 = alloca i32*, align 8, addrspace(5)
  %4 = alloca i32*, align 8, addrspace(5)
  %5 = addrspacecast i32* addrspace(5)* %3 to i32**
  %6 = addrspacecast i32* addrspace(5)* %4 to i32**
  store i32* %0, i32** %5, align 8
  store i32* %1, i32** %6, align 8
  %7 = load i32*, i32** %6, align 8
  %8 = load i32, i32* %7, align 4
  %9 = load i32*, i32** %5, align 8
  store i32 %8, i32* %9, align 4
  ret void
}
```
2. There are some differences in the place about `allocainst`,
specifically about addrspace (for `AMDGPU`,
[this](https://llvm.org/docs/AMDGPUUsage.html#address-spaces) will be
helpful). I have not found documentation describing how to write the
correct `LLVM IR` on `AMDGPU`, through my observation of the `LLVM IR`
generated by `clang++/hipcc`. We need to deal with the arguments of the
`__global__` function and the `allocainst` (including specifying the
addrspace of `allocainst` and performing addrspace-cast) while for the
`__device__` function we do not need to deal with the arguments of the
function.

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
galeselee added a commit that referenced this pull request Jan 6, 2023
Issue: ##6434

### Brief Summary
These unit tests are for #6486

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
quadpixels pushed a commit to quadpixels/taichi that referenced this pull request May 13, 2023
Issue: taichi-dev#6434

### Brief Summary
1. This is a special part of the Tacihi runtime module for the `AMDGPU`
backend. Tacihi's runtime module uses `clang++` to generate `LLVM IR` is
different in memory allocation differs from the cpu-generated `LLVM IR`.
The following is an example.
```
C/C++ code
void func(int *a, int *b) {
    *a = *b;
}
x86_64 backend LLVM IR
define dso_local void @cpu_func(i32* %0, i32* %1) taichi-dev#2 {
  %3 = alloca i32*, align 8
  %4 = alloca i32*, align 8
  store i32* %0, i32** %3, align 8
  store i32* %1, i32** %4, align 8
  %5 = load i32*, i32** %4, align 8
  %6 = load i32, i32* %5, align 4
  %7 = load i32*, i32** %3, align 8
  store i32 %6, i32* %7, align 4
  ret void
}
__global__ function on AMDGPU
define protected amdgpu_kernel void @global_func(i32 addrspace(1)* %0, i32 addrspace(1)* %1) taichi-dev#4 {
  %3 = alloca i32*, align 8, addrspace(5)
  %4 = alloca i32*, align 8, addrspace(5)
  %5 = alloca i32*, align 8, addrspace(5)
  %6 = alloca i32*, align 8, addrspace(5)
  %7 = addrspacecast i32* addrspace(5)* %3 to i32**
  %8 = addrspacecast i32* addrspace(5)* %4 to i32**
  %9 = addrspacecast i32* addrspace(5)* %5 to i32**
  %10 = addrspacecast i32* addrspace(5)* %6 to i32**
  %11 = addrspacecast i32 addrspace(1)* %0 to i32*
  store i32* %11, i32** %7, align 8
  %12 = load i32*, i32** %7, align 8
  %13 = addrspacecast i32 addrspace(1)* %1 to i32*
  store i32* %13, i32** %8, align 8
  %14 = load i32*, i32** %8, align 8
  store i32* %12, i32** %9, align 8
  store i32* %14, i32** %10, align 8
  %15 = load i32*, i32** %10, align 8
  %16 = load i32, i32* %15, align 4
  %17 = load i32*, i32** %9, align 8
  store i32 %16, i32* %17, align 4
  ret void
}
__device__ function on AMDGPU
define hidden void @device_func(i32* %0, i32* %1) taichi-dev#2 {
  %3 = alloca i32*, align 8, addrspace(5)
  %4 = alloca i32*, align 8, addrspace(5)
  %5 = addrspacecast i32* addrspace(5)* %3 to i32**
  %6 = addrspacecast i32* addrspace(5)* %4 to i32**
  store i32* %0, i32** %5, align 8
  store i32* %1, i32** %6, align 8
  %7 = load i32*, i32** %6, align 8
  %8 = load i32, i32* %7, align 4
  %9 = load i32*, i32** %5, align 8
  store i32 %8, i32* %9, align 4
  ret void
}
```
2. There are some differences in the place about `allocainst`,
specifically about addrspace (for `AMDGPU`,
[this](https://llvm.org/docs/AMDGPUUsage.html#address-spaces) will be
helpful). I have not found documentation describing how to write the
correct `LLVM IR` on `AMDGPU`, through my observation of the `LLVM IR`
generated by `clang++/hipcc`. We need to deal with the arguments of the
`__global__` function and the `allocainst` (including specifying the
addrspace of `allocainst` and performing addrspace-cast) while for the
`__device__` function we do not need to deal with the arguments of the
function.

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
quadpixels pushed a commit to quadpixels/taichi that referenced this pull request May 13, 2023
…7023)

Issue: #taichi-dev#6434

### Brief Summary
These unit tests are for taichi-dev#6486

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants